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Abstract 

In this paper, a new method is proposed for sparse PCA based on the re- 
cursive divide-and-conquer methodology. The main idea is to separate the 
original sparse PCA problem into a series of much simpler sub-problems, each 
having a closed-form solution. By recursively solving these sub-problems in 
an analytical way, an efficient algorithm is constructed to solve the sparse 
PCA problem. The algorithm only involves simple computations and is 
thus easy to implement. The proposed method can also be very easily ex- 
tended to other sparse PCA problems with certain constraints, such as the 
nonnegative sparse PCA problem. Furthermore, we have shown that the 
proposed algorithm converges to a stationary point of the problem, and its 
computational complexity is approximately linear in both data size and di- 
mensionality. The effectiveness of the proposed method is substantiated by 
extensive experiments implemented on a series of synthetic and real data 
in both reconstruction-error-minimization and data-variance-maximization 
viewpoints. 
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1. Introduction 

Principal component analysis (PCA) is one of the most classical and 
popular tools for data analysis and dimensionality reduction, and has a wide 
range of successful applications throughout science and engineering [T]. By 
seeking the so-called principal components (PCs), along which the data vari- 
ance is maximally preserved, PCA can always capture the intrinsic latent 
structure underlying data. Such information greatly facilitates many further 
data processing tasks, such as feature extraction and pattern recognition. 

Despite its many advantages, the conventional PCA suffers from the fact 
that each component is generally a linear combination of all data variables, 
and all weights in the linear combination, also called loadings, are typically 
non-zeros. In many applications, however, the original variables have mean- 
ingful physical interpretations. In biology, for example, each variable of gene 
expression data corresponds to a certain gene. In these cases, the derived PC 
loadings are always expected to be sparse (i.e. contain fewer non-zeros) so as 
to facilitate their interpretability. Moreover, in certain applications, such as 
financial asset trading, the sparsity of the PC loadings is especially expected 
since fewer nonzero loadings imply fewer transaction costs. 

Accordingly, sparse PCA has attracted much attention in the recent 
decade, and a variety of methods for this topic have been developed [2"H2"3]. 
The first attempt for this topic is to make certain post-processing transfor- 
mation, e.g. rotation [2] by Jolliffe and simple thresholding [3J by Cadima 
and Jolliffe, on the PC loadings obtained by the conventional PCA to en- 
force sparsity. Jolliffe and Uddin further advanced a SCoTLASS algorithm 
by simultaneously calculating sparse PCs on the PCA model with additional 
Zi-norm penalty on loading vectors pE]. Better results have been achieved 
by the SPCA algorithm of Zou et a!., which was developed based on it- 
erative elastic net regression [5]. D'Aspremont et al. proposed a method, 
called DSPCA, for finding sparse PCs by solving a sequence of semidefi- 
nite programming (SDP) relaxations of sparse PCA [6J. Shen and Huang 
developed a series of methods called sPCA-rSVD (including sPCA-rSVD; , 
sPCA-rSVD^, sPCA-rSVDsc'AD), computing sparse PCs by low-rank ma- 
trix factorization under multiple sparsity-including penalties [7\. Journee et 
al. designed four algorithms, denoted as GPower io , GPower^, GPower i0j?Ti , 
and GPower/ 1>m , respectively, for sparse PCA by formulating the issue as 
non-concave maximization problems with Iq- or Zi-norm sparsity-inducing 
penalties and extracting single unit sparse PC sequentially or block units 
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ones simultaneously [5]. Based on probabilistic generative model of PCA, 
some methods have also been attained [9HT2]. e.g. the EMPCA method 
derived by Sigg and Buhmann for sparse and/or nonnegative sparse PCA 
[9]. Sriperumbudur et al. provided an iterative algorithm called DCPCA, 
where each iteration consists of solving a quadratic programming (QP) prob- 
lem [T31 Hi] . Recently, Lu and Zhang developed an augmented Lagrangian 
method (ALSPCA briefly) for sparse PCA by solving a class of non-smooth 
constrained optimization problems [15]. Additionally, d'Aspremont derived 
a PathSPCA algorithm that computes a full set of solutions for all target 
numbers of nonzero coefficients [IS] , 

There are mainly two methodologies utilized by the current research on 
sparse PCA problem. The first is the greedy approach, including DSPCA [6], 
sPCA-rSVD EMPCA [9J, PathSPCA [IS], etc. These methods mainly 
focus on the solving of one-sparse-PC model, and more sparse PCs can be 
sequentially calculated on the deflated data matrix or data covariance [24J. 
Under this methodology, the first several sparse PCs underlying the data 
can generally be properly extracted, while the computation for more sparse 
PCs tends to be incrementally invalidated due to the cumulation of compu- 
tational error. The second is the block approach. Typical methods include 
SCoTLASS [1], GPower i0im , GPower Zl , m [g], ALSPCA [15], etc. These meth- 
ods aim to calculate multiple sparse PCs at once by utilizing certain block 
optimization techniques. The block approach for sparse PCA is expected to 
be more efficient than the greedy one to simultaneously attain multiple PCs, 
while is generally difficult to elaborately rectify each individual sparse PC 
based on some specific requirements in practice (e.g. the number of nonzero 
elements in each PC). 

In this paper, a new methodology, called the recursive divide-and-conquer 
(ReDaC briefly), is employed for solving the sparse PCA problem. The main 
idea is to decompose the original large and complex problem of sparse PCA 
into a series of small and simple sub-problems, and then recursively solve 
them. Each of these sub-problems has a closed-form solution, which makes 
the new method simple and very easy to implement. On one hand, as com- 
pared with the greedy approach, the new method is expected to integratively 
achieve a collection of appropriate sparse PCs of the problem by iteratively 
rectifying each sparse PC in a recursive way. The group of sparse PCs at- 
tained by the proposed method is further proved being a stationary solution 
of the original sparse PCA problem. On the other hand, as compared with 
the block approach, the new method can easily handle the constraints su- 
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perimposed on each individual sparse PC, such as certain sparsity and/or 
nonnegative constraints. Besides, the computational complexity of the pro- 
posed method is approximately linear in both data size and dimensionality, 
which makes it well-suited to handle large-scale problems of sparse PCA. 

In what follows, the main idea and the implementation details of the 
proposed method are first introduced in Section 2. Its convergence and com- 
putational complexity are also analyzed in this section. The effectiveness of 
the proposed method is comprehensively substantiated based on a series of 
empirical studies in Section 3. Then the paper is concluded with a summary 
and outlook for future research. Throughout the paper, we denote matrices, 
vectors and scalars by the upper-case bold-faced letters, lower-case bold-faced 
letters, and lower-case letters, respectively. 

2. The recursive divide-and-conquer method for sparse PCA 

In the following, we first introduce the fundamental models for the sparse 
PCA problem. 

2.1. Basic models of sparse PCA 

Denote the input data matrix as X = [xi, X2, . . . , x n ] T 6 M nxd , where 
n and d are the size and the dimensionality of the given data, respectively. 
After a location transformation, we can assume all {xj}™ =1 to have zero mean. 
Let S = -X T X e M. dxd be the data covariance matrix. 

n 

The classical PCA can be solved through two types of optimization models 
PQ. The first is constructed by finding the r(< <i)-dimensional linear subspace 
where the variance of the input data X is maximized [25]. On this data- 
variance-maximization viewpoint, the PCA is formulated as the following 
optimization model: 

maxTr(V T SV) s.t. V T V = I, (1) 

where Tr(A) denotes the trace of the matrix A and V = (vi, Vg, • • • , v r ) G 
M. dxr denotes the array of PC loading vectors. The second is formulated by 
seeking the r-dimensional linear subspace on which the projected data and 
the original ones are as close as possible [26J. On this reconstruction-error- 
minimization viewpoint, the PCA corresponds to the following model: 

min l|X-UV T lli s.t. V T V = I, (2) 
u,v 11 11 b 
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where ||A|| F is the Frobenius norm of A, V e ]R dxr is the matrix of PC 
loading array and U = (ui, u 2 , . . . , u r ) G ]R nxr is the matrix of projected 
data. The two models are intrinsically equivalent and can attain the same 
PC loading vectors [I]. 

Corresponding to the PCA models ([I]) and (|2), the sparse PCA problem 
has the following two mathematical formulationlr] 



maxTr(V T SV) s.t. vj v< = 1, ||v,|| p < U (i = 1, 2, . . . , r), (3) 



and 



min ||X-UV T || F s.t. vf Vi = 1, ||v 4 || p < t % (i = 1, 2, . . . , r), (4) 

where p = or 1 and the corresponding ||v|| p denotes the l - or the ly 
norm of v, respectively. Note that the involved Iq or li penalty in the above 
models ^ and tends to enforce sparsity of the output PCs. Methods 
constructed on Q include SCoTLASS 0], DSPCA % DCPCA 02 E], 
ALSPCA [IS], etc., and those related to Q include SPCA [5J, sPCA-rSVD 
[7], SPC [12], GPower [8], etc. In this paper, we will construct our method 
on the reconstruction-error-minimization model Q, while our experiments 
will verify that the proposed method also performs well based on the data- 
variance-maximization criterion. 

2.2. Decompose original problem into small and simple sub-problems 

The objective function of the sparse PCA model Q can be equivalently 
formulated as follows: 

2 



|X-UV T | 



F 



x - E i=1 "rf 



where Ej = X — XljyiUjvJ- ^ is then easy to separate the original large 
minimization problem, which is with respect to U and V, into a series of 
small minimization problems, which are each with respect to a column vector 
Uj of U and Vj of V for i — 1, 2, . . . , r, respectively, as follows: 

min ||Ei - u^vf ll!, s.t. vfvj = 1, ||vi|| p < U, (5) 



v, 



1 It should be noted that the orthogonality constraints of PC loadings in and ([2| 
are not imposed in ^ and Q. This is because simultaneously enforcing sparsity and 
orthogonality is generally a very difficult (and perhaps unnecessary) task. Like most of 
the existing sparse PCA methods [MS], we do not enforce orthogonal PCs in the models. 
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and 

min ||Ej - Ujvf II . (6) 

Through recursively optimizing these small sub-problems, the recursive divide- 
and-conquer (ReDaC) method for solving the sparse PCA model Q can then 
be naturally constructed. 

It is very fortunate that both the minimization problems in ([5j and ^ 
have closed-form solutions. This implies that the to-be-constructed ReDaC 
method can be fast and efficient, as presented in the following sub-sections. 

2.3. The closed-form solutions of pD and |6j) 

For the convenience of denotation, we first rewrite (|5| and (|6| as the 
following forms: 

min ||E-uv T ||^ s.t. v T v = 1, ||v|| p < t, (7) 

and 

|2 



mm 



E-uvHi;, (8) 



where u is n-dimensional and v is <i-dimensional. Since the objective function 
|| E — uv 7 )^ can be equivalently transformed as: 

||E - uv r ||^ = Tr((E - uv T ) r (E - uv T )) 

= || E ||| - 2Tr(E T uv T ) + Tr(vu T uv T ) 
= \\E\\ 2 F - 2u T Ev + u T uv r v, 

([7]) and (|8| are equivalent to the following optimization problems, respec- 
tively: 

max (E T u) T v s.t. v T v = 1, ||v|L < t, (9) 

V 

and 

min u T u-2(Ev) T u. (10) 

u 

The closed- form solutions of (|9| and (10), i.e. ^ and (Jsj) , can then be 
presented as follows. 

We present the closed-form solution to d8|) in the following theorem. 



Theorem 1. The optimal solution of M) is u*(v) = Ev. 
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The theorem is very easy to prove by calculating where the gradient of 
u T u — 2(Ev) T u is equal to zero. We thus omit the proof. 

In the p = case, the closed-form solution to ^ is presented in the follow- 
ing theorem. Here, we denote w = E T u, and hard\(w) the hard thresholding 
function, whose i-th element corresponds to I(\wi\ > A)u>j, where W{ is the 
z-th element of w and I(x) (equals 1 if x is ture, and otherwise) is the 
indicator function. The proof of the theorem is provided in Appendix A. 

Theorem 2. The optimal solution of 

max w T v s.t. v T v = 1, ||v||o < t, (11) 



is given by: 



t < 1, 



vS(w,t) = < C d Sl > k <t< k + ± (* = l,2,...,d-l), 

irir t>d, 
I! w|| 2 — 7 

where 9k denotes the k-th largest element of |w|. 

In the above theorem, denotes the empty set, implying that when t < 1, 



the optimum of (11) does not exist. 



In the p = 1 case, ^ has the following closed-form solution. In the 
theorem, we denote / W (A) = ^oftx{w)\\ > wnere so ft\( w ) represents the soft 
thresholding function sign(w)(\w\ — A) + , where (x) + represents the vec- 
tor attained by projecting x to its nonnegative orthant, and (ii, I 2 , . . . , Id) 
denotes the permutation of (1,2, ... ,d) based on the ascending order of 
|w| = (\uuil, \w 2 \, \w d \) T . 

Theorem 3. The optimal solution of 

max w T v s.t. v T v = 1, ||v||i < t, (12) 

V 

is given by: 

( 0, t < 1, 

* ( /w(\)> ll/w(KI)||i<t< ||/w(K_J)||i (* = 2 J 3,...,d-l), 

Vllw ' ] \ U(\\ ||/w(KI)||i<t< Vd, 
{ /w(o), t>Vd, 
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where for k = 1, 2, . . . , d — 1, 



^ _ (m - t 2 )(Er = i gO ~ Vt 2 (rn - t 2 )(m Eg g - Ei ^) 2 ) ^ 

m(m — t 2 ) 

where (ai,a 2) . . . , a m ) = {\w Ik \, \w Ik+1 \, \w Id \), m = d - k + 1. 

It should be noted that we have proved that ||/w(|w/ d _ 1 |)||i = 1 and 
||iw(A)||i is a monotonically decreasing function with respect to A in Lemma 
1 of the appendix. This means that we can conduct the optimum v*(w) of 
the optimization problem ^ for any w based on the above theorem. 

The ReDaC algorithm can then be easily constructed based on Theorems 

1-3. 

2.4- The recursive divide- and- conquer algorithm for sparse PCA 

The main idea of the new algorithm is to recursively optimize each col- 
umn, Uj of U or Vj of V for i = 1,2, ... ,r, with other u^s and VjS (j ^ i) 
fixed. The process is summarized as follows: 

• Update each column v, of V for i = 1,2, . . . ,r by the closed-form 
solution of ([5]) attained from Theorem 2 (for p = 0) or Theorem 3 (for 
p = l). 

• Update each column u« of U for i = 1,2, ...,r by the closed-form 
solution of (|6| calculated from Theorem 1. 

Through implementing the above procedures iteratively, U and V can be 
recursively updated until the stopping criterion is satisfied. We summarize 
the aforementioned ReDaC technique as Algorithm [1} 

We then briefly discuss how to specify the stopping criterion of the algo- 
rithm. The objective function of the sparse PCA model Q is monotonically 
decreasing in the iterative process of Algorithm 1 since each of the step 5 
and step 6 in the iterations makes an exact optimization for a column vector 
Uj of U or Vj of V, with all of the others fixed. We can thus terminate the 
iterations of the algorithm when the updating rate of U or V is smaller than 
some preset threshold, or the maximum number of iterations is reached. 

Now we briefly analyze the computational complexity of the proposed 
ReDaC algorithm. It is evident that the computational complexity of Algo- 
rithm [T] is essentially determined by the iterations between step 5 and step 
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Algorithm 1 ReDaC algorithm for sparse PCA 

Input: Data matrix X G R nxd , number of sparse PCs r, sparsity parameters 
t = (ti, . . . , t r ). 

1: Initialize U = (m, u 2 , . . . , u r ) G R nxr , V = (v 1; u 2 , . . . , v r ) G R dxr . 
2: repeat 

3: for i — 1, . . . r do 

4: Compute Ej = X - Y^j+i u i v J- 

5: Update Vj via solving ([5]) based on Theorem 2 (for p = 0) or 

Theorem 3 (for p = 1 ) . 
6: Update Uj via solving ^ based on Theorem 1. 

7: end for 

8: until stopping criterion satisfied. 
Output: The sparse PC loading vectors V = (v 1; v 2 , . . . , v r ). 



6, i.e. the calculation of the closed-form solutions of Vj and Uj of V and U, 
respectively. To compute u i; only simple operations are involved and the 
computation needs 0(nd) cost. To compute Vj, a sorting for the elements 
of the ^-dimensional vector |w| = |E T u| is required, and the total compu- 
tational cost is around 0{nd log d) by applying the well-known heap sorting 
algorithm [27]. The whole process of the algorithm thus requires around 
0(rnd log d) computational cost in each iteration. That is, the computa- 
tional complexity of the proposed algorithm is approximately linear in both 
the size and the dimensionality of input data. 

2.5. Convergence analysis 

In this section we evaluate the convergence of the proposed algorithm. 

The convergence of our algorithm can actually be implied by the mono- 
tonic decrease of the cost function of Q during the iterations of the al- 
gorithm. In specific, in each iteration of the algorithm, step 5 and step 6 
optimize the column vector Uj of U or Vj of V, with all of the others fixed, 
respectively. Since the objective function of Q is evidently lower bounded 
(> 0), the algorithm is guaranteed to be convergent. 

We want to go a further step to evaluate where the algorithm converges. 
Based on the formulation of the optimization problem Q, we can construct 
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a specific function as follows: 



/(Ui, . . . , U r , Vi, . . . , V r ) = / (Ui, . . . , U r , Vi, . . . , V r ) + ^ fi( V i)- ( 13 ) 



.Til 



x - E P , u * v * T 
< 'i=i 



where 

/ (ui,...,u r ,vi,...,v r ) = ||X-UV UF 

and for each of i — 1, . . . , r, /i(vj) is an indicator function defined as: 

if IKHp < U and vfvj = 1, 
oo, otherwise. 



It is then easy to show that the constrained optimization problem Q is 
equivalent to the unconstrained problem 



min /(u 1 ,...,u r ,vi,...,v r ). 

{Ui,Vi}^ =1 



(14) 



The proposed ReDaC algorithm can then be viewed as a block coordinate 
descent (BCD) method for solving (14) j2H], by alteratively optimizing Uj, Vj, 
% = l,2,...,r, respectively Then the following theorem implies that our 
algorithm can converge to a stationary point of the problem. 

Theorem 4 (|28j). Assume that the level set X° = {x : f(x) < f(x )} 
is compact and that f is continuous on X°. If /(u 1; . . . ,u r , v 1; . . . , v r ) is 
regular and has at most one minimum in each Uj and v, with others fixed 
fori = 1,2, ...,r, then the sequence (u 1; . . . , u r , v 1; . . . , v r ) generated by 
Algorithm 1 converges to a stationary point of f. 

In the above theorem, the assumption that the function /, as defined in 



(14), is regular holds under the condition that dom(fo) is open and fo is 
Gateaux-differentiable on dom(f ) (Lemma 3.1 under Condition Al in 
Based on Theorems 1-3, we can also easily see that /(u 1; . . . , u r , v 1; . . . , v r ) 
has unique minimum in each Uj and Vj with others fixed. The above theorem 
can then be naturally followed by Theorem 4.1(c) in [28J. 

Another advantage of the proposed ReDaC methodology is that it can be 
easily extended to other sparse PCA applications when certain constraints 
are needed for output sparse PCs. In the following section we give one of the 
extensions of our methodology — nonnegative sparse PCA problem. 
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2.6. The ReDaC method for nonnegative sparse PC A 

The nonnegative sparse PCA [29J problem differs from the conventional 
sparse PCA in its nonnegativity constraint imposed on the output sparse 
PCs. The nonnegativity property of this problem is especially important in 
some applications such as microeconomics, environmental science, biology, 
etc. [30J. The corresponding optimization model is written as follows: 



min ||X - UV T ||^ s.t. vfvj = 1, ||vi|| p < U, V; y (i = 1, 2, . . . , r), 

(15) 

where Vj >z means that each element of Vj is greater than or equal to 0. 

By utilizing the similar recursive divide-and-conquer strategy, this prob- 
lem can be separated into a series of small minimization problems, each with 
respect to a column vector Uj of U and Vj of V for i — 1, 2, . . . , r, respectively, 
as follows: 

min ||Ej - Ujvfll!, s.t. vfv* = 1, ||vj|| p < t h v, h (16) 



and 



min ||Ej - Ujvf ||^, , (17) 



where p — or 1. Since (17) is of the same formulation as ([6]), we only 



rewrite (16) as: 



need to discuss how to solve (16). For the convenience of denotation, we first 



min ||E-uv r ||p s.t. v T v = 1, ||v|| p < t, v h 0. (18) 



The closed- form solution of (18) is given in the following theorem. 



Theorem 5. The closed-form solution of (18) is v*((w) + ,t) (p = 0,1), 
where w = E T u, and Vq(-, •) and v*(-, •) are defined in Theorem 2 and The- 
orem 3, respectively. 



By virtue of the closed-form solution of (18) given by Theorem 5, we 



can now construct the ReDaC algorithm for solving nonnegative sparse PCA 



model (15). Since the algorithm differs from Algorithm 1 only in step 5 (i.e. 
updating of Vj), we only list this step in Algorithm 2. 

We then substantiate the effectiveness of the proposed ReDaC algorithms 
for sparse PCA and nonnegative sparse PCA through experiments in the next 
section. 
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Algorithm 2 ReDaC algorithm for nonnegative sparse PCA 



5: Update v, via solving (16) based on Theorem 5. 



3. Experiments 

To evaluate the performance of the proposed ReDaC algorithm on the 
sparse PCA problem, we conduct experiments on a series of synthetic and 
real data sets. All the experiments are implemented on Matlab 7.11(R2010b) 
platform in a PC with AMD Athlon(TM) 64 X2 Dual 5000+@2.60 GHz 
(CPU), 2GB (memory), and Windows XP (OS). In all experiments, the 
SVD method is utilized for initialization. The proposed algorithm under 
both p = and p = 1 was implemented in all experiments and mostly have 
a similar performance. We thus only list the better one throughout. 

3.1. Synthetic simulations 

Two synthetic data sets are first utilized to evaluate the performance 
of the proposed algorithm on recovering the ground-truth sparse principal 
components underlying data. 

3.1.1. Hastie data 

Hastie data set was first proposed by Zou et al. [5J to illustrate the 
advantage of sparse PCA over conventional PCA on sparse PC extraction. 
So far this data set has become one of the most frequently utilized benchmark 
data for testing the effectiveness of sparse PCA methods. The data set is 
generated in the following way: first, three hidden factors V\, V 2 and V 3 are 
created as: 

Vi ~ JV(0,290), V 2 ~ A^(0,300), V 3 = 0.3Vi + 0.925U 2 + e, 

where e ~ Af(0, 1), and V\, V 2 and e are independent; afterwards, 10 observ- 
able variables are generated as: 

X i = V 1 + s 1 i , i = 1,2,3,4, 
X l = V 2 + s 2 i , i = 5,6,7,8, 
X l = V 3 + el 2 = 9,10, 

where e\ ~ ftf(0, 1) and all els are independent. The data so generated 
are of intrinsic sparse PCs [5]: the first recovers the factor V 2 only using 
(X 5 , X 6 , X 7 , X 8 ), and the second recovers V\ only utilizing (X±, X 2 , X 3 , X±). 
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We generate 100 sets of data, each contains 1000 data generated in the 
aforementioned way, and apply Algorithm 1 to them to extract the first two 
sparse PCs. The results show that our algorithm can perform well in all 
experiments. In specific, the proposed ReDaC algorithm faithfully delivers 
the ground-truth sparse PCs in all experiments. The effectiveness of the 
proposed algorithm is thus easily substantiated in this series of benchmark 
data. 

3.1.2. Synthetic toy data 

As [7] and [8] , we adopt another interesting toy data, with intrinsic sparse 
PCs, to evaluate the performance of the proposed method. The data are gen- 
erated from the Gaussian distribution A/"(0, S) with mean and covariance 
£ G lR 10xl °, which is calculated by 

10 

Here, (ci, c 2 , c w ), the eigenvalues of the covariance matrix S, are pre- 
specified as (250,240,50,50,6,5,4,3,2,1), respectively, and (vi, v 2 , vio) 
are 10-dimensional orthogonal vectors, formulated by 

vi = (0.422, 0.422, 0.422, 0.422, 0, 0, 0, 0, 0.380, 0.380) T , 
v 2 = (0, 0, 0, 0, 0.489, 0.489, 0.489, 0.489, -0.147, 0.147) T , 

and the rest being generated by applying Gram-Schmidt orthonormalization 
to 8 randomly valued 10-dimensional vectors. It is easy to see that the data 
generated under this distribution are of first two sparse PC vectors Vi and 

Four series of experiments, each involving 1000 sets of data generated 
from Af(0,£), are utilized, with sample sizes 500, 1000, 2000, 5000, re- 
spectively. For each experiment, the first two PCs, vi and v 2 , are cal- 
culated by a sparse PCA method and then if both |vjVi| > 0.99 and 
|v 2 r v 2 | > 0.99 are satisfied, the method is considered as a success. The 
proposed ReDaC method, together with the conventional PCA and 12 cur- 
rent sparse PCA methods, including SPCA [5], DSPCA [6], PathSPCA [16], 
sPCA-rSVDi , sPCA-rSVD^, sPCA-rSVD S cAB 0, EMPCA % GPower; , 
GPower^, GPower Zo m , GPower^ [8] and ALSPCA [15], have been im- 
plemented, and the success times for four series of experiments have been 
recorded and summarized, respectively. The results are listed in Table [TJ 
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Table 1: Comparison of success times of PC A and different sparse PCA methods in syn- 
thetic toy experiments with sample size varying. The best results are highlighted in bold. 





n = 500 


n = 1000 


n = 2000 


n = 5000 


PCA 














SPCA 


566 


673 


756 


839 


DSPCA 


211 


203 


138 


62 


PathSPCA 


189 


187 


186 


171 


sPCA-rSVD io 


646 


702 


797 


906 


sPCA-rSVD ;i 


649 


715 


806 


909 


sPCA-rSVDgcAD 


649 


715 


806 


909 


EMPCA 


649 


715 


806 


909 


GPower/ 


155 


154 


155 


139 


GPower Zl 


122 


127 


126 


126 


GPower; m 


91 


76 


71 


16 


GPowerfj m 


90 


92 


88 


82 


ALSPCA 


669 


749 


826 


927 


ReDaC 


676 


748 


827 


928 



The advantage of the proposed ReDaC algorithm can be easily observed 
from Table [T] In specific, our method always attains the highest or second 
highest success times (in the size 1000 case, 1 less than ALSPCA) as com- 
pared with the other utilized methods in all of the four series of experiments. 
Considering that the ALSPCA method, which is the only comparable method 
in these experiments, utilizes strict constraints on the orthogonality of out- 
put PCs while the ReDaC method does not utilize any prior ground-truth 
information of data, the capability of the proposed method on sparse PCA 
calculation can be more prominently verified. 

3.2. Experiments on real data 

In this section, we further evaluate the performance of the proposed 
ReDaC method on two real data sets, including the pitprops and colon 
data. Two quantitative criteria are employed for performance assessment. 
They are designed in the viewpoints of reconstruction-error-minimization and 
data-variance-maximization, respectively, just corresponding to the original 
formulations Q and ^ for sparse PCA problem. 

• Reconstruction- error-minimization criterion: RRE. Once sparse PC 
loading matrix V is obtained by a method, the input data can then 
be reconstructed by X = UV T , where U = XV(V T V) _1 , attained by 
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the least square method. Then the relative reconstruction error (RRE) 
can be calculated by 

RRE = ~ II vll ~> 

to assess the performance of the utilized method in data reconstruction 
point of view. 

• Data-variance-maximization criterion: PEV. After attaining the sparse 
PC loading matrix V, the input data can then be reconstructed by 
X = XV(V r V) _1 V T , as aforementioned. And thus the variance of the 
reconstructed data can be computed by Tr(^-X T X). The percentage 
of explained variance (PEV, [7]) of the reconstructed data from the 
original one can then be calculated by 

to evaluate the performance of the utilized method in data variance 
point of view. 

3.2.1. Pitprops data 

The pitprops data set, consisting of 180 observations and 13 measured 
variables, was first introduced by Jeffers [3TJ to show the difficulty of inter- 
preting PCs. This data set is one of the most commonly utilized examples 
for sparse PCA evaluation, and thus is also employed to testify the effec- 
tiveness of the proposed ReDaC method. The comparison methods include 
SPCA [5], DSPCA ©, PathSPCA [16], sPCA-rSVD /o , sPCA-rSVD^, sPCA- 
rSVD SCAD [7J, EMPCA 0, GPower Zo , GPower^, GPower iom , GPower hm [| 
and ALSPCA [T5]. For each utilized method, 6 sparse PCs are extracted 
from the pitprops data, with different cardinality settings: 8-5-6-2-3-2 (al- 
together 26 nonzero elements), 7-4-4-1-1-1 (altogether 18 nonzero elements, 
as set in [5]) and 7-2-3-1-1-1 (altogether 15 nonzero elements, as set in [6]), 
respectively. In each experiment, both the RRE and PEV values, as defined 
above, are calculated, and the results are summarized in Table [2j Figure 
[T] further shows the the RRE and PEV curves attained by different sparse 
PCA methods in all experiments for more illumination. It should be noted 
that the GPower; 0i?n , GPower/ li?n and ALSPCA methods employ the block 
methodology, as introduced in the introduction of the paper, and calculate 
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Table 2: Performance comparison of different sparse PCA methods on pitprops data with 
different cardinality settings. The best result in each experiment is highlighted in bold. 



8-5-6-2-3-2(26) 7-4-4-1-1-1(18) 7-2-3-1-1-1(15) 





RRE 


PEV 


RRE 


PEV 


RRE 


PEV 


SPCA 


0.4162 


82.68% 


0.4448 


80.22% 


0.4459 


80.11% 


DSPCA 


0.4303 


81.48% 


0.4563 


79.18% 


0.4771 


77.23% 


PathSPCA 


0.4080 


83.35% 


0.4660 


80.11% 


0.4457 


80.13% 


sPCA-rSVD io 


0.4139 


82.87% 


0.4376 


80.85% 


0.4701 


77.90% 


sPCA-rSVD^ 


0.4314 


81.39% 


0.4427 


80.40% 


0.4664 


78.25% 


sPCA-rSVDgcAD 


0.4306 


81.45% 


0.4453 


80.17% 


0.4762 


77.32% 


EMPCA 


0.4070 


83.44% 


0.4376 


80.85% 


0.4451 


80.18% 


GPower; 


0.4092 


83.26% 


0.4400 


80.64% 


0.4457 


80.13% 


GPower ;i 


0.4080 


83.35% 


0.4460 


80.11% 


0.4457 


80.13% 


GPower; m 


0.4224 


82.16% 


0.5089 


74.10% 


0.4644 


78.44% 


GPowerij m 


0.4187 


82.46% 


0.4711 


77.81% 


0.4589 


78.94% 


ALSPCA 


0.4168 


82.63% 


0.4396 


80.67% 


0.4537 


79.42% 


ReDaC 


0.4005 


83.50% 


0.4343 


81.14% 


0.4420 


80.46% 



all sparse PCs at once while cannot sequentially derive different numbers of 
sparse PCs with preset cardinality settings. Thus the results of these meth- 
ods reported in Tableware calculated with the total sparse PC cardinalities 
being 26, 18 and 15, respectively, and are not included in Figure [Tj 

It can be seen from Table [2] that under all cardinality settings of the first 
6 PCs, the proposed ReDaC method always achieves the lowest RRE and 
highest PEV values among all the competing methods. This means that the 
ReDaC method is advantageous in both reconstruction-error-minimization 
and data- variance-maximization viewpoints. Furthermore, from Figure [TJ it 
is easy to see the superiority of the ReDaC method. In specific, for different 
number of extracted sparse PC components, the proposed ReDaC method 
can always get the smallest RRE values and the largest PEV values, as 
compared with the other utilized sparse PCA methods, in the experiments. 
This further substantiates the effectiveness of the proposed ReDaC method 
in both reconstruction-error-minimization and data-variance-maximization 
views. 

3.2.2. Colon data 

The colon data set (32] consists of 62 tissue samples with the gene ex- 
pression profiles of 2000 genes extracted from DNA micro-array data. This 
is a typical data set with high-dimension and low-sample-size property, and 
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Cardinality Setting: 8-5- 6-2-3-2 Cardinality Setting: 7-4-4-1-1-1 Cardinality Setting: 7-2- 3-1-1-1 




> ' ' ' ' — I ^ I 2U7 D ' ' ' ' 1 = 1 dU7 a ' ' ' ' 1 ~ 

123456 123456 123456 

Number of Sparse Principal Components Number of Sparse Principal Components Number of Sparse Principal Components 



Figure 1: The tendency curves of RRE and PEV with respect to the number of extracted 
sparse PCs attained by different sparse PCA methods on pitprops data. Three cardinality 
settings for the extracted sparse PCs are utilized, including 8-5-6-2-3-2, 7-4-4-1-1-1 and 
7-2-3-1-1-1. 

is always employed by sparse methods for extracting interpretable informa- 
tion from high-dimensional genes. We thus adopt this data set for evalua- 
tion. In specific, 20 sparse PCs, each with 50 nonzero loadings, are calcu- 
lated by different sparse PCA methods, including SPCA [5J, PathSPCA |16j . 
sPCA-rSVD io , sPCA-rSVD h , sPCA-rSVD 5CMD [7], EMPCA [9], GPower; , 
GPower^, GPower i() m , GPower^ m [8] and ALSPCA [IS], respectively. Their 
performance is compared in Tableland Figure [2] in terms of RRE and PEV, 
respectively. It should be noted that the DSPCA method has also been tried, 
while cannot be terminated in a reasonable time in this experiment, and thus 
we omit its result in the table. Besides, we have carefully tuned the parame- 
ters of the GPower methods (including GPower/ , GPower ii; GPower; m and 
GPower; lm ), and can get 20 sparse PCs with total cardinality around 1000, 
similar as the total nonzero elements number of the other utilized sparse PCA 
methods, while cannot get sparse PC loading sequences each with cardinality 
50 as expected. The results are thus not demonstrated in Figure [2} 

From Table [3j it is easy to see that the proposed ReDaC method achieves 
the lowest RRE and highest PEV values, as compared with the other 11 em- 
ployed sparse PCA methods. Figure [2] further demonstrates that as the num- 
ber of extracted sparse PCs increases, the advantage of the ReDaC method 
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Table 3: Performance comparison of different sparse PCA methods on colon data. The 
best results are highlighted in bold. 





SPCA 


PathSPCA 


sPCA-rSVD io 


sPCA-rSVD ;i 


RRE. 


0.7892 


0.5287 


0.5236 


0.5628 


PEV. 


37.72% 


72.05% 


72.58% 


68.32% 




sPCA-rSVD SCAD 


EMPCA 


GPower; 


GPower^ 


RRE. 


0.5723 


0.5211 


0.5042 


0.5076 


PEV. 


67.25% 


72.84% 


74.56% 


74.23% 




GPower/ m 


GPower; lm 


ALSPCA 


RcDaC 


RRE. 


0.4870 


0.4904 


0.5917 


0.4737 


PEV. 


76.29% 


75.95% 


64.99% 


77.56% 



1 I 1 1 1 1 1 1 1 1 1 1 90% 




2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 

Number of Sparse Principal Components Number of Sparse Principal Components 



Figure 2: The tendency curves of RRE and PEV with respect to the number of extracted 
sparse PCs, each with cardinality 50, attained by different sparse PCA methods on colon 
data. 

tends to be more dominant than other methods, with respect to both the 
RRE and PEV criteria. This further substantiates the effectiveness of the 
proposed method and implies its potential usefulness in applications with 
various interpretable components. 

3.3. Nonnegative sparse PCA experiments 

We further testify the performance of the proposed ReDaC method (Algo- 
rithm 2) in nonnegative sparse PC extraction. For comparison, two existing 
methods for nonnegative sparse PCA, NSPCA [29] and Nonnegative EMPCA 
(N-EMPCA, briefly) [9], are also employed. 
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Table 4: Performance comparison of success times attained by PCA, NSPCA, N-EMPCA 
and ReDaC on synthetic toy experiments with different sample sizes. The best results are 
highlighted in bold. 





n = 500 


n = 1000 


n = 2000 


n = 5000 


PCA 














NSPCA 


739 


948 


933 


993 


N-EMPCA 


620 


655 


631 


639 


ReDaC 


835 


949 


978 


1000 



3.3.1. Synthetic toy data 

As the toy data utilized in Section 3.2, we also formulate a Gaussian 
distribution J\f(0, S) with mean and covariance matrix £ = J2]=i c j w jvJ G 
M 10xl °. Both the leading two eigenvectors of £ are specified as nonnegative 
and sparse vectors as: 



v 2 = (0, 0.140, 0, 0.840, 0, 0.280, 0, 0.140, 0, 0.420) 5 



vi = (0.474, 0, 0.158, 0, 0.316, 0, 0.791, 0, 0.158, 0) J , 

) 



and the rest are then generated by applying Gram-Schmidt orthonormal- 
ization to 8 randomly valued 10-dimensional vectors. The 10 corresponding 
eigenvalues (ci, C2, cio) are preset as (210, 190, 50, 50, 6, 5,4, 3, 2, 1), respec- 
tively. Four series of experiments are designed, each with 1000 data sets 
generated from Af(0, £), with sample sizes 500, 1000, 2000 and 5000, re- 
spectively. For each experiment, the first two PCs are calculated by the 
conventional PCA, NSPCA, N-EMPCA and ReDaC methods, respectively. 
The success times, calculated in the similar way as introduced in Section 



3.1.2[ of each utilized method on each series of experiments are recorded, as 
listed in Table gj 

From Table |4| it is seen that the ReDaC method achieves the highest 
success rates in all experiments. The advantage of the proposed ReDaC 
method on nonnegative sparse PCA calculation, as compared with the other 
utilized methods, can thus been verified in these experiments. 

3.3.2. Colon data 

The colon data set is utilized again for nonnegative sparse PCA calcula- 
tion. The NSPCA and N-EMPCA methods are adopted as the competing 
methods. Since the NSPCA method cannot directly pre-specify the cardi- 
nalities of the extracted sparse PCs, we thus first apply NSPCA on the colon 
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Table 5: Performance comparison of different nonnegative sparse PC A methods on colon 
data. The best results are highlighted in bold. 





NSPCA 


N-EMPCA 


RcDaC 


RRE 


0.3674 


0.3399 


0.2706 


PEV 


86.50% 


88.45% 


92.68% 




5 10 15 

Number of Nonnegative Sparse Principal Components 



30% 




5 10 15 

Number of Nonnegative Sparse Principal Components 



Figure 3: The tendency curves of RRE and PEV, with respect to the number of extracted 
nonnegative sparse PCs, attained by NSPCA, N-EMPCA and ReDaC on colon data. 



data (with parameters a = 1 x 10 6 and (3 = 1 x 10 7 ) and then use the car- 
dinalities of the nonnegative sparse PCs attained by this method to preset 
the N-EMPCA and ReDaC methods for fair comparison. 20 sparse PCs are 
computed by the three methods, and the performance is compared in Table 
[5] and Figure [3j in terms of RRE and PEV, respectively. 

Just as expected, it is evident that the proposed ReDaC method dom- 
inates in both RRE and PEV viewpoints. From Table [5j we can observe 
that our method achieves the lowest RRE and highest PEV on 20 extracted 
nonnegative sparse PCs than the other two utilized methods. Furthermore, 
Figure [3] shows that our method is advantageous, as compared with the other 
methods, for any preset number of extracted sparse PCs, and this advantage 
tends to be more significant as more sparse PCs are to be calculated. The 
effectiveness of the proposed method on nonnegative sparse PCA calculation 
can thus be further verified. 

3. 3. 3. Application to face recognition 

In this section, we introduce the performance of our method in face 
recognition problem [29]. The proposed ReDaC method, together with the 
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PCA 



NSPCA 



N-EMPCA 



ReDaC 




Figure 4: From top row to bottom row: 10 PCs or nonnegative sparse PCs extracted by 
PCA, NSPCA, N-EMPCA and ReDaC, respectively. 



Table 6: Performance comparison of different nonnegative sparse PCA methods on MIT 
CBCL Face Dataset #1. The best results are highlighted in bold. 





NSPCA 


N-EMPCA 


ReDaC 


RRE 


0.6993 


0.6912 


0.6606 


PEV 


51.10% 


52.22% 


56.36% 



conventional PCA, NSPCA and N-EMPCA methods, have been applied to 
this problem and their performance is compared in this application. The 
employed data set is the MIT CBCL Face Dataset #1, downloaded from 
"http:/ /cbcl.mit.edu/software-datasets/FaceData2.htmr. This data set con- 
sists of 2429 aligned face images and 4548 non-face images, each with reso- 
lution 19 x 19. For each of the four utilized methods, 10 PC loading vectors 
are computed on face images, as shown in Figure [4], respectively. For easy 
comparison, we also list the RRE and PEV values of three nonnegative sparse 
PCA methods in Table |U 

As depicted in Figure |4j the nonnegative sparse PCs obtained by the 
ReDaC method more clearly exhibit the interpretable features underlying 
faces, as compared with the other utilized methods, e.g. the first five PCs 
calculated from our method clearly demonstrate the eyebrows, eyes, cheeks, 
mouth and chin of faces, respectively. The advantage of the proposed method 
can further be verified quantitatively by its smallest RRE and largest PEV 
values, among all employed methods, in the experiment, as shown in Table 
|6j The effectiveness of the ReDaC method can thus be substantiated. 

To further show the usefulness of the proposed method, we apply it to face 
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Table 7: Performance comparison of the classification accuracy obtained by different non- 
negative sparse PCA methods. The best results are highlighted in bold. 



Face (%) Non-face (%) Total (%) 



LR 
PCA + LR 
NSPCA + LR 
N-EMPCA + LR 
ReDaC + LR 



96.71 93.57 94.47 

96.64 94.17 94.88 

94.89 93.49 93.89 

96.71 94.39 95.06 

96.78 94.46 95.84 



classification under this data set as follows. First we randomly choose 1000 
face images and 1000 non-face images from MIT CBCL Face Dataset #1, 
and take them as the training data and the rest images as testing data. We 
then extract 10 PCs by utilizing the PCA, NSPCA, N-EMPCA and ReDaC 
methods to the training set, respectively. By projecting the training data 
onto the corresponding 10 PCs obtained by these four methods, respectively, 
and then fitting the linear Logistic Regression (LR) (33] model on these 
dimension-reduced data (10-dimensional), we can get a classifier for testing. 
The classification accuracy of the classifier so obtained on the testing data 
is then computed, and the results are reported in Table [7| In the table, 
the classification accuracy attained by directly fitting the LR model on the 
original training data and testing on the original testing data is also listed 
for easy comparison. 

From Table [7j it is clear that the proposed ReDaC method attains the 
best performance among all implemented methods, most accurately recog- 
nizing both the face images and the non-face images from the testing data. 
This further implies the potential usefulness of the proposed method in real 
applications. 

4. Conclusion 

In this paper we have proposed a novel recursive divide-and-conquer 
method (ReDaC) for sparse PCA problem. The main methodology of the 
proposed method is to decompose the original large sparse PCA problem into 
a series of small sub-problems. We have proved that each of these decom- 
posed sub-problems has a closed-form global solution and can thus be easily 
solved. By recursively solving these small sub-problems, the original sparse 
PCA problem can always be very effectively resolved. We have also shown 
that the new method converges to a stationary point of the problem, and can 
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be easily extended to other sparse PCA problems with certain constraints, 
such as nonnegative sparse PCA problem. The extensive experimental results 
have validated that our method outperforms current sparse PCA methods 
in both reconstruction-error-minimization and data-variance-maximization 
viewpoints. 

There are many interesting investigations still worthy to be further ex- 
plored. For example, when we reformat the square L2-norm error of the 
sparse PCA model as the Li-norm one, the robustness of the model can 
always be improved for heavy noise or outlier cases, while the model is corre- 
spondingly more difficult to solve. By adopting the similar ReDaC method- 
ology, however, the problem can be decomposed into a series of much simpler 
sub-problems, which are expected to be much more easily solved than the 
original model. Besides, although we have proved the convergence of the 
ReDaC method, we do not know how far the result is from the global op- 
timum of the problem. Stochastic global optimization techniques, such as 
simulated annealing and evolution computation methods, may be combined 
with the proposed method to further improve its performance. Also, more 
real applications of the proposed method are under our current research. 

[1] I. T. Jolliffe, Principal Component Analysis, 2nd Edition, Springer, New 
York, 2002. 

[2] I. T. Jolliffe, Rotation of principal components - choice of normalization 
constraints, Journal of Applied Statistics 22 (1) (1995) 29-35. 

[3] J. Cadima, I. T. Jolliffe, Loadings and correlations in the interpretation 
of principal components, Journal of Applied Statistics 22 (2) (1995) 
203-214. 

[4] I. T. Jolliffe, N. T. Trendafilov, M. Uddin, A modified principal com- 
ponent technique based on the lasso, Journal of Computational and 
Graphical Statistics 12 (3) (2003) 531-547. 

[5] H. Zou, T. Hastie, R. Tibshirani, Sparse principal component analysis, 
Journal of Computational and Graphical Statistics 15 (2) (2006) 265- 
286. 

[6] A. d'Aspremont, L. El Ghaoui, M. I. Jordan, G. Lanckriet, A direct for- 
mulation for sparse pea using semidefinite programming, Siam Review 
49 (3) (2007) 434-448. 



23 



[7] H. P. Shen, J. Huang, Sparse principal component analysis via regular- 
ized low rank matrix approximation, Journal of Multivariate Analysis 
99 (6) (2008) 1015-1034. 

[8] M. Journee, Y. Nesterov, P. Richtarik, R. Sepulchre, Generalized power 
method for sparse principal component analysis, Journal of Machine 
Learning Research 11 (2010) 517-553. 

[9] C. Sigg, J. Buhmann, Expectation-maximization for sparse and non- 
negative pea, in: Proceedings of the 25th International Conference on 
Machine Learning, ACM, 2008, pp. 960-967. 

[10] Y. Guan, J. Dy, Sparse probabilistic principal component analysis, in: 
Proceedings of 12th International Conference on Artificial Intelligence 
and Statistics, 2009, pp. 185-192. 

[11] K. Sharp, M. Rattray, Dense message passing for sparse principal com- 
ponent analysis, in: Proceedings of 13th International Conference on 
Artificial Intelligence and Statistics, 2010, pp. 725-732. 

[12] C. Archambeau, F. Bach, Sparse probabilistic projections, in: D. Koller, 
D. Schuurmans, Y. Bengio, L. Bottou (Eds.), Advances in Neural Infor- 
mation Processing Systems 21, MIT Press, Cambridge, MA, 2009, pp. 
73-80. 

[13] B. Sriperumbudur, D. Torres, G. Lanckriet, Sparse eigen methods by dc 
programming, in: Proceedings of the 24th International Conference on 
Machine Learning, ACM, 2007, pp. 831-838. 

[14] B. K. Sriperumbudur, D. A. Torres, G. Lanckriet, A majorization- 
minimization approach to the sparse generalized eigenvalue problem, 
Machine Learning 85 (1-2) (2011) 3-39. 

[15] Z. Lu, Y. Zhang, An augmented lagrangian approach for sparse principal 
component analysis, Mathematical Programming 135 (1-2) (2012) 149- 
193. 

[16] A. d'Aspremont, F. Bach, L. Ghaoui, Full regularization path for sparse 
principal component analysis, in: Proceedings of the 24th International 
Conference on Machine Learning, ACM, 2007, pp. 177-184. 



24 



[17] B. Moghaddam, Y. Weiss, S. Avidan, Spectral bounds for sparse pea: 
Exact and greedy algorithms, in: Y. Weiss, B. Scholkopf, J. Piatt (Eds.), 
Advances in Neural Information Processing Systems 18, MIT Press, 
Cambridge, MA, 2006, pp. 915-922. 

[18] A. d'Aspremont, F. Bach, L. El Ghaoui, Optimal solutions for sparse 
principal component analysis, Journal of Machine Learning Research 9 
(2008) 1269-1294. 

[19] D. M. Witten, R. Tibshirani, T. Hastie, A penalized matrix decompo- 
sition, with applications to sparse principal components and canonical 
correlation analysis, Biostatistics 10 (3) (2009) 515-534. 

[20] A. Farcomeni, An exact approach to sparse principal component analy- 
sis, Computational Statistics 24 (4) (2009) 583-604. 

[21] Y. Zhang, L. E. Ghaoui, Large-scale sparse principal component analysis 
with application to text data, in: J. Shawe- Taylor, R. Zemel, P. Bartlett, 
F. Pereira, K. Weinberger (Eds.), Advances in Neural Information Pro- 
cessing Systems 24, MIT Press, Cambridge, MA, 2011, pp. 532-539. 

[22] D. Y. Meng, Q. Zhao, Z. B. Xu, Improve robustness of sparse pea by 
Zi-norm maximization, Pattern Recognition 45 (1) (2012) 487-497. 

[23] Y. Wang, Q. Wu, Sparse pea by iterative elimination algorithm, Ad- 
vances in Computational Mathematics 36 (1) (2012) 137-151. 

[24] L. Mackey, Deflation methods for sparse pea, in: D. Koller, D. Schu- 
urmans, Y. Bengio, L. Bottou (Eds.), Advances in Neural Information 
Processing Systems 21, MIT Press, Cambridge, MA, 2009, pp. 1017— 
1024. 

[25] H. Hotelling, Analysis of a complex of statistical variables into principal 
components, Journal of Educational Psychology 24 (1933) 417-441. 

[26] K. Pearson, On lines and planes of closest fit to systems of points in 
space, Philosophical Magazine 2 (7-12) (1901) 559-572. 

[27] D. Knuth, The Art of Computer Programming, Addison- Wesley, Read- 
ing, MA, 1973. 



25 



[28] P. Tseng, Convergence of a block coordinate descent method for nondif- 
ferentiable minimization, Journal of Optimization Theory and Applica- 
tions 109 (3) (2001) 475-494. 

[29] R. Zass, A. Shashua, Nonnegative sparse pea, in: B. Scholkopf, J. Piatt, 
T. Hoffman (Eds.), Advances in Neural Information Processing Systems 
19, MIT Press, Cambridge, MA, 2007, pp. 1561-1568. 

[30] A. Cichocki, R. Zdunek, A. Phan, S. Amari, Nonnegative Matrix and 
Tensor Factorizations: Applications to Exploratory Multi-way Data 
Analysis and Blind Source Separation, Wiley, 2009. 

[31] J. Jeffers, Two case studies in the application of principal component 
analysis, Applied Statistics 16 (1967) 225-236. 

[32] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, 
A. Levine, Broad patterns of gene expression revealed by clustering 
analysis of tumor and normal colon tissues probed by oligonucleotide 
arrays, Cell Biology 96 (12) (1999) 6745-6750. 

[33] J. Friedman, T. Hastie, R. Tibshirani, The Elements of Statistical Learn- 
ing, Springer, 2001. 



26 



Appendix A. Proof of Theorem 2 

In the following, we denote w = E T u, and hard\(w) the hard threshold- 
ing function, whose i-th element corresponds to I(\wi\ > \)wi, where Wi is 
the i-th element of w and I(x) (equals 1 if £ is ture, and otherwise) is the 
indicator function 

Theorem 2. The optimal solution of 

max w T v s.t. v T v = 1, ||v|| < t, 



0, t < 1, 




hardg k (w) 
\\hardg (w)|| 2 



W 2 



k < t < k + 1 (k = 1,2, ...,d- 1), 
t > d. 



where 9k denotes the k-th largest element of |w|. 

Proof. In case of t < 1, the feasible region of the optimization problem is 
empty, and thus the solution of the problem does not exist. 
In case of t > d, the problem is equivalent to 

max w T v s.t. v T v = 1. 

V 

It is then easy to attain the optimum of the problem v* 



W 2 



In case of A; < t < k + 1 (k = 1,2, . . . ,d — 1), the optimum v* of the 
problem is parallel to w on the /c-dimensional subspace where the first k 
largest absolute value of w are located. Also due to the constraint that v T v = 
1, it is then easy to deduce that the optimal solution of the optimization 

p roblem is pSwk- 

The proof is completed. 
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Appendix B. Proof of Theorem 3 

We denote I 2 , . . ■ , Id) the permutation of (1,2,..., d) based on the 
ascending order of |w| = |w 2 1 , • • • , |wd|) T , soft\{w) the soft threshold- 

ing function sign(w)(\w\ - A) + , / W (A) = and g> w (A) = w T / w (A) 

throughout the following. 
Theorem 3. The optimal solution of 

max w T v s.t. v T v = 1, ||v||i < t, 

V 



0, t < 1, 

/w(A fc ), tGlll/wCKDIIx.ll/wCK.xDIli) (k = 2,s,...,d-i) 

UK), te[||/w(KI)||i,Vrf), 

/w(o), t>Vd, 

where for k = 1,2 d—1, 




x _ (m - t 2 )(Er = i oQ - >/t 2 (m - *2)(m g E B5 55 

m(m — t z ) 

where (ai,o 2 , . . . , a m ) = (|w/J, |w/ fc+1 |, • • ■ , |wjj), m = d - + 1. 



Proof. For any v located in the feasible region of (12), it holds that 

\fd = V d~v T \ > || v ||i > V v r v = 1. 

We thus have that if t < 1, then the optimal solution v* does not exist since 
the feasible region of the optimization problem (9) is empty. 
If t > Vd, it is easy to see that <\12n is equivalent to 



and its optimum is 



rp rp 

max w v s.t. v v = 1, 



w 

= /w 0). 



|W|| 2 



We then discuss the case when t G [l,\/d). Firstly we deduce the 



monotonic decreasing property of /i w (A) = ||/ w (A)||i = h <, nxiw! \\ 
<?w(A) = w T / w (A) in A G (— oo, \wi d \) by the following lemmas. 



so/t A (w) 



and 

i 
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Lemma 1. /i w (A) is monotonically decreasing with respect to A in (—00, |tuj d |). 

Proof. First, we prove that h w (X) is monototically decreasing with A G 
[|w/ fc _J, \wi k \), k = 2,3, . . . ,d and (-00, \w h \). 

It is easy to see that for A G [|w/ fc _ 1 |, |tu/ fc |), k — 2, 3, . . . , d and (—00, \ wi 1 1), 



Then we have 

-(d-k + iVEUKI-A)^ £t fe (KI - A) 



/'w(A) 



EU(KI-A) 2 

(d \ ~ 3/2 / d / d 

£(M ~~ A) 2 j -(d - fc + 1) Y^wd - A) 2 + " A) 

i=fc / \ i=k \i=k 

It is known that for any number sequence si, S2, . . . , s n , it holds that 



^ n E 

,i=i / i=i 



* 2 . 



Thus we have 

/4(A) < 

for A G [|it;j fc _ 1 1 , 1), k = 2,3, ... ,d and (—00, liujj). Since h yv (X) is ob- 
viously a continuous function in (—00, |tu/ d |), it can be easily deduced that 
h w (X) is monotonically decreasing in the entire set (—00, \wi d \) with respect 
to A. 

The Proof is completed. 

Based on Lemma 1, It is easy to deduce that the range of h w (X) for 
A G (— 00, \wj d \) is [l,y/d), since lim /i w (A) = \fd and h w (X) = 1 for 

A e [K d _J, \w Id \). 

The following lemma shows the monotonic decreasing property of g w (A). 
Lemma 2. <? W (A) monotonically decreasing with respect to X G (—00, |iu/ d |). 
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Proof. Please see [22] for the proof. 

The next lemma proves that the optimal solution v* can be expressed as 
/ W (A*). 



Lemma 3. The optimal solution of (12) is of the expression v* = / W (A*) 
for t G [1, \fd) on some A* G (— oo, |w/J). 

Proof. Please see [19], [22] for the proof. 



Lemmas 1-3 imply that the optimal solution of ( 12 ) is attained at A* where 
||/w(A*)||i = t holds. The next lamma presents the closed-form solution of 
this equation. 

Lemma 4. The solutmon o/||/ w (A)||i = tfort G [||/ w (k/J) ||i, \\fA\ w h-i\)\\i] 
(fc = 2,3,...,d-l), ort G [||/w(KI)||i,>/d) is 



Xk _ (m - t 2 xEr=i go - yt*(m - gjgj EE g - (EE <o 2 ) ? 

m(m — t 2 ) 

where (a 1} a 2 , . . . , a m ) = (\w Ih \, \w Ik+1 \, • • • , \w Id \) and m = d — k + l. 
Proof. Let's transform the equation 

ii/.(a)ii.= p^tdzM = t (19) 

\/E?. t (KI - A) 2 VEi-.K-A) 2 

as the following expression 

m m 
i=l i=l 

Then we can get the quadratic equation with respect to A as: 

mm m 

m{m - t 2 )X 2 - 2(m - t 2 )(^ a,)A + a,) 2 - t 2 ^ a 2 = 0. (20) 

i=l i=l i=l 
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We first claim that t 2 < m for t G [||/ w (|m>jJ)||i, ll/wd^/^DIIi), k = 
2, 3, . . . , d — 1, or t G [||/w(l w /i|)||i) V^)- 111 fact, by the definition of / W (A), 
we have that 



* < ||/w(K k _J 



TZi( a i-\ w i k -i\y 



m, 



for i 6 [||/„(K|)||i, ||/ w (K_,|)||i), k = 2,...,d-l, and 



< 



Tti^-lv^l) 2 
dY" ( K-lv^l) 



for t e [||/w(|w/i I) ||i, Vd). Then it can be seen that the discriminant of 
equation (20) 

m m 

A = t 2 (m- t 2 )(m Y^a 2 -(J2 ^f) > 0, 



i=i 



i=i 



using the fact that (X^i a i) 2 — m SIii a i- Therefore, the solutions of 
equation (20) can be expressed as 

(m - t 2 )(Er=i Oi) ± >/* 2 ("»-* 2 )(^E™ia?-(E™iOi) 2 ) 



A 



m(m — i 2 ) 
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It holds that 

(m - i 2 )(£r=i a,) + v/t 2 (m-^)( mE ™ ia ^( E ™ iaj ) 2 ) 



> 



m{m — t 2 ) 



m(m — t 2 ) 
i=i a i t 2^i=k\ w h 



m d — k + 1 



If A + > |w/J, since A < \wi k \ required by equation (19), then 

, , _ _ (m - t 2 ){YZi gO - Vt 2 (m - t 2 )(m g - (Sg^g) 

Afc — A — -; "777 . 

m{m — t z ) 

Otherwise, if A + = \wj k \, then it holds that (J2iLi a i) 2 = m 2~^ili a i> which 
naturally leads to A& = A + = A - . 
The proof is then completed. 

Based on the above Lemmas 1-4, the conclusion of Theorem 3 can then 
be obtained. 
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Appendix C. Proof of Theorem 5 



Theorem 5. The global optimal solution to (18) is v*((w)+,£) (p = 0,1), 
where w = E T u, and Vq(-, ■) and v*(-, •) are defined in Theorem 2 and The- 
orem 3, respectively. 

It is easy to prove this theorem based on the following lemma. 

Lemma 5. Assume that there is at least one element of w is positive, then 
the optimization problem 

(PI) max w T v s.t. v T v = 1, ||v|| p < t, v >z 0, 

V 

can be equivalently soved by 

(-P2) max (w)Tv s.t. v T v = 1, ||v|L < t, 

V 

where p is or 1. 

Proof. Denote the optimal solutions of (PI) and (P2) as vl and v2, re- 
spectively. 

First, we prove that w T vl > w T v2. Based on Theorem 2 and 3, the 
elements of v2 are of the same signs (or zeros) with the corresponding ones 
of (w)_|_. This means that v2 y natrually holds. That is, v2 belongs 
to the feasible region of (PI). Since vl is the optimum of (PI), we have 
w T vl > w T v2. 

Then we prove that w T vl < w T v2 through the following three steps. 

(CI): The nonzero elements of vl —(v[ , v d ) lie on the positions 

where the nonnegative entries of w are located. 

If all elements of w are nonnegative, then (CI) is evidently satisfied. 

Otherwise, there is an element, denoted as the z-th element Wi of w, is 
negative and the corresponding element, v\ , of vl is nonzero (i.e. positive). 
We further pick up a nonnegative element, denoted as Wj, from w. Then we 
can construct a new d- dimensional vector v = (vi,v 2 , ■■■,v i ) as 

0, k = i, 



Then we have 
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W T V = ^WkVk = Wj\J (vf ] ) 2 + (vf ] ) 2 + ^ W k V k ] 
k k^i,j 

> WiV^p + WjVj 1 ^ + ^2 W k v< k^ 

= w T vl. 

We get the inequality by the fact that Wj\J (v • 1 ' ) ) 2 + (v^) 2 > WjV^ and 

> WiV^K This is contradict to the fact that vl is the optimal solution of 
(PI), noting that ||v|| p < ||v|| p < t. 
The conclusion (CI) is then proved. 

(C2): The nonzero elements of v2 ={v P, v<?\ ...,v^) lie on the positions 
where the nonzero entries of (w) + are located. 

Denote (w) + = (wf , ■■■,Wj). If all elements of (w)+ are positive, 
then (C2) is evidently satisfied. 

Otherwise, let be a zero element of w and the corresponding element, 
vf \ of v2 is nonzero, and let be a positive element of w. Then we can 
construct a new rf-dimensional vector v = (vi,V2, ■■■,Vi) as 

{0, k = i, 

^F) 2 + (v?) 2 , k = j, 

Then we have 



k k^i,j 

= (w)Jv2. 



We get the first inequality by the fact that w 




2) 



and = wfv\ . This is contradict to the fact that v2 is the optimal solution 
of (-P2), noting that ||v|| p < ||v|| p < t. 
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The conclusion (C2) is then proved. 

(C3): We can then prove that w T vl < w T v2 based on the conclusions 
(CI) and (C2) as follows: 



w T vl = (w)^vl < (w)^v2 = w T v2. 

In the above equation, the first equality is conducted by (CI), the second 
inequality is based on the fact that v2 is the optimal solution of (-P2), and 
the third equality is followed by (C2). 

Thus it holds that w T vl = w T v2. This implies that the optimization 
problem (PI) can be equivalently solved by (P2). 
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